智能论文笔记

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Teven Le Scao , Angela Fan , Christopher Akiki , Ellie Pavlick , Suzana Ilić , Daniel Hesslow , Roman Castagné , Alexandra Sasha Luccioni , François Yvon , Matthias Gallé

分类：自然语言处理

2022-11-09

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

translated by 谷歌翻译

Neural Data-to-Text Generation Based on Small Datasets: Comparing the Added Value of Two Semi-Supervised Learning Approaches on Top of a Large Language Model

Chris van der Lee , Thiago Castro Ferreira , Chris Emmery , Travis Wiltshire , Emiel Krahmer

分类：自然语言处理

2022-07-14

这项研究讨论了半监督学习的影响与验证的语言模型，以生成数据到文本。当还补充大规模语言模型时，尚不清楚半监督学习是否仍然有用。这项研究的目的是通过将仅补充语言模型的数据到文本系统与两个数据到文本系统进行比较，这些系统通过数据增强或伪标记的半固定学习方法而富含数据。结果表明，半监督学习会导致多样性指标的得分更高。在输出质量方面，使用伪标记方法扩展数据到文本系统的训练集确实提高了文本质量分数，但是数据增强方法在没有训练设置扩展的情况下得出了与系统相似的分数。这些结果表明，即使也存在语言模型，半监督的学习方法也可以增强产出质量和多样性。

translated by 谷歌翻译

Combining Multi-Fidelity Modelling and Asynchronous Batch Bayesian Optimization

Jose Pablo Folch , Robert M Lee , Behrang Shafei , David Walz , Calvin Tsay , Mark van der Wilk , Ruth Misener

分类：机器学习 | (统计)机器学习

2022-11-11

Bayesian Optimization is a useful tool for experiment design. Unfortunately, the classical, sequential setting of Bayesian Optimization does not translate well into laboratory experiments, for instance battery design, where measurements may come from different sources and their evaluations may require significant waiting times. Multi-fidelity Bayesian Optimization addresses the setting with measurements from different sources. Asynchronous batch Bayesian Optimization provides a framework to select new experiments before the results of the prior experiments are revealed. This paper proposes an algorithm combining multi-fidelity and asynchronous batch methods. We empirically study the algorithm behavior, and show it can outperform single-fidelity batch methods and multi-fidelity sequential methods. As an application, we consider designing electrode materials for optimal performance in pouch cells using experiments with coin cells to approximate battery performance.

translated by 谷歌翻译

Where is VALDO? VAscular Lesions Detection and segmentatiOn challenge at MICCAI 2021

Carole H. Sudre , Kimberlin Van Wijnen , Florian Dubost , Hieab Adams , David Atkinson , Frederik Barkhof , Mahlet A. Birhanu , Esther E. Bron , Robin Camarasa , Nish Chaturvedi

分类：计算机视觉 | 人工智能

2022-08-15

脑小血管疾病的成像标记提供了有关脑部健康的宝贵信息，但是它们的手动评估既耗时又受到实质性内部和间际变异性的阻碍。自动化评级可能受益于生物医学研究以及临床评估，但是现有算法的诊断可靠性尚不清楚。在这里，我们介绍了\ textIt {血管病变检测和分割}（\ textit {v textit {where valdo？}）挑战，该挑战是在国际医学图像计算和计算机辅助干预措施（MICCAI）的卫星事件中运行的挑战（MICCAI） 2021.这一挑战旨在促进大脑小血管疾病的小而稀疏成像标记的自动检测和分割方法的开发，即周围空间扩大（EPVS）（任务1），脑微粒（任务2）和预先塑造的鞋类血管起源（任务3），同时利用弱和嘈杂的标签。总体而言，有12个团队参与了针对一个或多个任务的解决方案的挑战（任务1 -EPVS 4，任务2 -Microbleeds的9个，任务3 -lacunes的6个）。多方数据都用于培训和评估。结果表明，整个团队和跨任务的性能都有很大的差异，对于任务1- EPV和任务2-微型微型且对任务3 -lacunes尚无实际的结果，其结果尤其有望。它还强调了可能阻止个人级别使用的情况的性能不一致，同时仍证明在人群层面上有用。

translated by 谷歌翻译

Applying Machine Learning to Crowd-sourced Data from Earthquake Detective

Omkar Ranadive , Suzan van der Lee , Vivian Tang , Kevin Chao

分类：机器学习

2020-11-05

动态触发的地震和震颤产生了两类弱的地震信号，它们的检测，认同和身份验证传统上要求进行费力的分析。近年来，机器学习（ML）已成为地球物理分析中的强大效率工具，包括检测时间序列中的特定信号。但是，检测埋在噪声挑战中的弱信号ML算法，部分原因是无处不在的训练数据并不总是可用。在这种情况下，ML可能像人类专家效率低下一样无效。在这一有效性和效率的交汇处，我们利用了过去十年中普及的第三个工具：公民科学。公民科学项目地震侦探利用志愿者的眼睛和耳朵来检测和对潜在动态触发（PDT）事件的地震图中的弱信号进行分类。在这里，我们介绍了地震侦探数据集 - PDT地震和震颤上的一组众包标签。我们应用机器学习来对这些PDT地震事件进行分类，并探索在分离和分类此类信号时面临的挑战。我们确认，使用基于图像和小波的算法，机器学习可以从小地震中检测信号。此外，我们报告说，我们的ML算法还可以检测到PDT震颤的信号，这尚未证明。分类和ML代码的公民科学数据集可在线获得。

translated by 谷歌翻译

Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

Jens-Joris Decorte , Jeroen Van Hautte , Johannes Deleu , Chris Develder , Thomas Demeester

分类：自然语言处理

2022-09-13

技能在就业市场和许多人力资源（HR）过程中起着核心作用。在其他数字经验之后，当今的在线工作市场有候选人希望根据他们的技能看到正确的机会。同样，企业越来越需要使用数据来确保其劳动力中的技能保持未来。但是，有关技能的结构化信息通常缺少，并且基于自我或经理评估的流程已证明与所得数据的采用，完整性和新鲜度有关。鉴于明确或仅隐含地描述了数千种可能的技能标签，并且缺乏精细注释的培训语料库，提取技能是一项艰巨的任务。以前的技能提取工作过于简化任务，将其用于明确的实体检测任务，或者基于手动注释的培训数据，如果应用于完整的技能词汇，这是不可行的。我们根据遥远的字面匹配，提出了一个用于技能提取的端到端系统。我们提出并评估了几种负面验证数据集中的几种负面抽样策略，以提高技能提取对隐式提及技能的推广，尽管在遥远的监督数据中缺乏这种隐性技能。我们观察到，使用ESCO分类法从相关技能中选择负面示例会产生最大的进步，并且在一个模型中结合三种不同的策略进一步提高了性能，在RP@5中最多可达8个百分点。我们介绍了基于ESCO分类法的手动注释评估基准，以进行技能提取，并在其上验证模型。我们发布基准数据集以进行研究目的，以刺激对任务的进一步研究。

translated by 谷歌翻译

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud , Arthur Mensch , Jordan Hoffmann , Trevor Cai , Eliza Rutherford , Katie Millican , George van den Driessche , Jean-Baptiste Lespiau , Bogdan Damoc , Aidan Clark

分类：自然语言处理 | 机器学习

2021-12-08

我们通过与与前面令牌的局部相似度，通过调节从大语料库检索的文档块来增强自动回归语言模型。尽管使用25美元\时分，我们的检索增强型变压器（RetroCro）的检索增强型变压器（RetroCr）对GPT-3和侏罗纪-1获得了可比性的性能。微调后，复古表演转换为下游知识密集型任务，如问题应答。复古结合了冷冻BERT猎犬，一种可微分的编码器和块状的横向机制，以预测基于数量级的令牌，而不是训练期间通常消耗的数量。我们通常从头开始训练复古，还可以快速改造预先接受的变压器，通过检索，仍然达到良好的性能。我们的工作通过以前所未有的规模开辟了通过显式内存改进语言模型的新途径。

translated by 谷歌翻译

Risks from Learned Optimization in Advanced Machine Learning Systems

Evan Hubinger , Chris van Merwijk , Vladimir Mikulik , Joar Skalse , Scott Garrabrant

分类：人工智能

2019-06-05

我们分析了学习型号（如神经网络）本身是优化器时发生的学习优化的类型 - 我们将作为MESA优化的情况，我们在本文中介绍的新闻。我们认为，MESA优化的可能性为先进机器学习系统的安全和透明度提出了两个重要问题。首先，在什么情况下学习模型是优化的，包括当他们不应该？其次，当学习模型是优化器时，它的目标是什么 - 它将如何与损失函数不同，它训练的损失 - 并且如何对齐？在本文中，我们对这两个主要问题进行了深入的分析，并提供了未来研究的主题概述。

translated by 谷歌翻译

Faster Maximum Inner Product Search in High Dimensions

Mo Tiwari , Ryan Kang , Je-Yong Lee , Luke Lee , Chris Piech , Sebastian Thrun , Ilan Shomorony , Martin Jinye Zhang

分类：机器学习 | 人工智能

2022-12-14

Maximum Inner Product Search (MIPS) is a popular problem in the machine learning literature due to its applicability in a wide array of applications, such as recommender systems. In high-dimensional settings, however, MIPS queries can become computationally expensive as most existing solutions do not scale well with data dimensionality. In this work, we present a state-of-the-art algorithm for the MIPS problem in high dimensions, dubbed BanditMIPS. BanditMIPS is a randomized algorithm that borrows techniques from multi-armed bandits to reduce the MIPS problem to a best-arm identification problem. BanditMIPS reduces the complexity of state-of-the-art algorithms from $O(\sqrt{d})$ to $O(\text{log}d)$, where $d$ is the dimension of the problem data vectors. On high-dimensional real-world datasets, BanditMIPS runs approximately 12 times faster than existing approaches and returns the same solution. BanditMIPS requires no preprocessing of the data and includes a hyperparameter that practitioners may use to trade off accuracy and runtime. We also propose a variant of our algorithm, named BanditMIPS-$\alpha$, which employs non-uniform sampling across the data dimensions to provide further speedups.

translated by 谷歌翻译

MABSplit: Faster Forest Training Using Multi-Armed Bandits

Mo Tiwari , Ryan Kang , Je-Yong Lee , Sebastian Thrun , Chris Piech , Ilan Shomorony , Martin Jinye Zhang

分类：机器学习

2022-12-14

Random forests are some of the most widely used machine learning models today, especially in domains that necessitate interpretability. We present an algorithm that accelerates the training of random forests and other popular tree-based learning methods. At the core of our algorithm is a novel node-splitting subroutine, dubbed MABSplit, used to efficiently find split points when constructing decision trees. Our algorithm borrows techniques from the multi-armed bandit literature to judiciously determine how to allocate samples and computational power across candidate split points. We provide theoretical guarantees that MABSplit improves the sample complexity of each node split from linear to logarithmic in the number of data points. In some settings, MABSplit leads to 100x faster training (an 99% reduction in training time) without any decrease in generalization performance. We demonstrate similar speedups when MABSplit is used across a variety of forest-based variants, such as Extremely Random Forests and Random Patches. We also show our algorithm can be used in both classification and regression tasks. Finally, we show that MABSplit outperforms existing methods in generalization performance and feature importance calculations under a fixed computational budget. All of our experimental results are reproducible via a one-line script at https://github.com/ThrunGroup/FastForest.

translated by 谷歌翻译